{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "“he is in fact preparing a rejection at present, because putin is of course scared to tell president trump that he wants to continue this war, that he wants to kill ukrainians,” zelensky said. “that’s why in moscow they are imposing upon the idea of a ceasefire these conditions, so that nothing happens at all, or so that it cannot happen for as long as possible.”\n",
      "\n",
      "zelensky said washington had expressed readiness to organise control and verification of the ceasefire. “this is possible to ensure, with america’s possibilities, and europe’s possibilities,” he added.\n",
      "\n",
      "the president said a ceasefire would give time “to prepare answers to all questions regarding long-term security and a real, reliable peace, and put on the table a plan to end the war”.\n",
      "\n",
      "trump, meanwhile, hosted the nato secretary general, mark rutte, at the white house where he acknowledged putin’s “promising” remarks, but warned that negotiations were far from complete. speaking to reporters, trump said that he would love to meet putin. “but we have to get it over with fast,” he said. “hopefully russia will do the right thing.”\n",
      "\n",
      "trump said washington was receiving “good signals” from the kremlin. he added that discussions with ukraine centred on “pieces of land that would be kept and lost”. kyiv, however, insisted it would not recognise russian rights over ukrainian territory.\n",
      "\n",
      "“we agree with the proposals to end military hostilities but we must proceed from the fact that the cessation must be one that leads to a long-lasting peace and eliminates the initial reasons for this crisis,” putin said at a press conference with his staunch ally, alexander lukashenko of belarus.his remarks came as moscow pressed its advantage on the battlefield, retaking territory in the kursk region where ukraine has been waging a counter-incursion. russian forces had wrested back control of the town of sudzha, the russian defence ministry claimed, a day after putin visited the area.\n",
      "\n",
      "troops had also liberated two nearby villages, podol and melovoy, the ministry said, with a major-general predicting the operation to “cleanse our territory from the enemy” would be completed within days.\n",
      "\n",
      "with about 6,000 inhabitants, sudzha was the most significant settlement seized by ukraine during its advance into the region, which began swiftly in august 2024. valery gerasimov, chief of the general staff, told putin on wednesday that in the previous five days russian forces had regained 24 settlements from the ukrainians, the defence ministry added.\n",
      "\n",
      "gerasimov also told the president that 600 russian soldiers had walked 15km through an empty gas pipeline in order to attack ukrainian positions. ukraine did not confirm the loss of sudzha but independent analysts reported that russia had partially or fully seized the town.\n"
     ]
    }
   ],
   "source": [
    "# 从多个文件读取数据\n",
    "def load_text_data(file_paths):\n",
    "    text_data = \"\"\n",
    "    for file in file_paths:\n",
    "        with open(file, 'r', encoding='utf-8') as f:\n",
    "            text_data += f.read().lower()  # 统一小写\n",
    "    return text_data\n",
    "\n",
    "# 你可以使用多个数据集文件\n",
    "file_paths = [\"data1.txt\", \"data2.txt\"]\n",
    "text = load_text_data(file_paths)\n",
    "print(text)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "编码: ['he', 'll', 'o', 'w', 'or', 'ld']\n",
      "解码: he ll o w or ld\n"
     ]
    }
   ],
   "source": [
    "from tokenizers import Tokenizer, models, trainers, pre_tokenizers\n",
    "\n",
    "# 创建 Tokenizer\n",
    "tokenizer = Tokenizer(models.BPE())\n",
    "\n",
    "# 训练 Tokenizer\n",
    "trainer = trainers.BpeTrainer(special_tokens=[\"<pad>\", \"<bos>\", \"<eos>\"])\n",
    "tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()\n",
    "tokenizer.train([\"data1.txt\", \"data2.txt\"], trainer)\n",
    "\n",
    "# 测试编码 & 解码\n",
    "encoded = tokenizer.encode(\"hello world\")\n",
    "print(\"编码:\", encoded.tokens)\n",
    "print(\"解码:\", tokenizer.decode(encoded.ids))\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
